Overview

Questions

    1. How do the different habitats differ by taxonomic and functional diversity?
    1. What taxa and functions are most associated with cryospheric environments? Do these differ between habitats?
    1. Can we accurately classify what samples belong to which site using high-level taxonomic ranks?

Hypothesis

  • We will observe greater taxonomic diversity between sites than functional diversity
    • There is a limited number of adaptations that microorganisms can evolve to survive in cryospheric environments, so we should expect a significant degree of convergent evolution towards the same pathways

Pipeline

Overview

  • The Nextflow pipeline used to process the reads first performs quality control, including trimming and contaminant removal.
    • Afterwards, it combines reads using a 99% similarity threshold into operational taxonomic units or OTUs. The idea is that because every otu is unique, each one represents an individual species
# A single FastQ sequence has three components
example <- read.delim("./data/raw/Bihor_Mountain/SRR2998649.part-1.fastq")
show <- paste(c(example[4, ], example[5, ], example[6, ], example[7, ]), collapse = "\n")
message(show)
@SRR2998649.2 ITEWJBF02GOECX length=439
TACGAGGGTATCTAATCCGGTTCGCTCCCCACGCTTTCGTGCCTCAGTGTCAGAAATAGCCTAGTAACCTGCCTACGCCATTGGTGTTCCTTCTAATATCTACGGATTTCACTCCTACACTAGAAATTCCAGTTACCTCTGCTACTCTCGAGTTTGCCAGTTTGAATAATAGTCTGTGTGGTTGAGCCACCAGATTTCACCATTCACTTAACAAACCACCTACGCAACTCTTTACGCCCAGTCACTCCGGATAATGCTTGCACCCTACGTATGACCGCGGCTGCTGGCACGTAGTTAGCCGGTGCTTATTCATAAGTTACCGTCATATTCTTCACTTATAAAAGCAGTTTACGACCCGAAGGCCTTCATCCTGCACGCGGCGTTGCTCCATCAGACTTTCGTCCATTGTGGAAGATTCCTCACTGCTGCCTCCCGTAGG
+SRR2998649.2 ITEWJBF02GOECX length=439
IIIIIHHHIIIIIIIIIIIIIIIIII;;;;IIIIHHHIIIIIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHD666?IIHHHIIIIIIII???IIIIIIIIIIIIIIIIIIIIIIIIIIGGGIIIIIHHHIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIIIHHHIIII;;;HHHIIIIIIIIIIIIIIHGGHIIIHHHIIIIIIIIIIIIIIIIICCCIH;<<<DIIIIIIIIHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIIIIHHHIIIIIIHHHHIIIIIIIIIIIIICCCBDDIIHHHIIIIHHHIIIIIIIIIIIIHHHIIIIIIIIIIIIIHDCCHIIID@@CCCEE@@@C@@@?@@@@?9<<CC?B@@@EEEEEEE===E7777
  • Input:
    • Raw reads in FastQ format: 13 sites, 10 samples each
      • A non-cryoconic sample (“New Zealand soil”) was added as an “outgroup” comparison
    • Databases,
    • Pre-trained classifier model
  • Output:
    • OTU frequency tables,
    • Taxonomic classifications of each OTU,
    • Pathway frequency table
    • Rooted phylogenetic trees workflow

Importing data from fastq

Qiime2 uses a compressed type of file format called an ‘Artifact’ for its analyses. Artifacts have different semantic types e.g. FeatureData[Sequence], Phylogeny[Unrooted] depending on the type of data they contain. To begin the analysis, the fastq files were imported into FeatureData[SequencesWithQuality] or FeatureData[PairedEndSequencesWithQuality] artifacts

  • Although all of these reads were prepared with Illumina devices, sequencing quality can vary between sequencing centres, meaning that each sample will likely need specific parameters for cleaning.
    • This means that each sample needs to be imported into separate artifacts, then merged once they have been cleaned.
  • Quality \((Q)\) is commonly measured in Phred scores, denoted as \(Q=-10 \log_{10}P\), where \(P\) is the probability of an incorrect base call. Therefore, higher values for Phred indicate a lower probability of an erroneous base. Every base position is given a Phred score, and it is common to see the score decrease the longer the read
# Relevant commands
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path devon.tsv \
  --output-path devonFQ.qza \
  --input-format PairedEndFastqManifestPhred64V2

qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path neem.tsv \
  --output-path neem.qza \
  --input-format SingleEndFastqManifestPhred33V2

Database preparation

  • Two databases were obtained:

The MicFunPred and NCBI databases were obtained in the standard BLAST format, and need to be converted into compatible data types for import into the qiime2 workflow. Specficially, I needed to convert them into FASTA format with an associated taxonomy mapping file (in HeaderlessTSVTaxonomyFormat)

  • Steps
      1. Extract all entries from the database in FASTA format
      1. Extract the header and convert it into the HeaderLessTSVTaxonomyFormat, which is a tab-delimited file of FASTA identifiers followed by their taxonomic assignments
      1. Remove the taxonomic assignments from the original FASTA file
      1. Concatenate the respective files types together, then import them files as qiime2 Artifacts
      1. Repeat for the other database (if both databases used the identifier conventions I could have combined them and processed them together but unfortunately this was not the case )
# The database of identifiers
database_example <- read_tsv("./data/blastdb/all_uniqIDs.txt")
Rows: 130121 Columns: 2
── Column specification ──────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): 100000, Bacteria;Firmicutes;Erysipelotrichi;Erysipelotrichales;Erys...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
database_example
# 1
blastdbcmd -db NCBI_16S/16S_ribosomal_RNA -entry all > NCBI_16S.fasta
blastdbcmd -db micfun/micfun16S -entry all > micfun.fasta

# 2
grep '>' NCBI_16S.fasta | tr -d '>' | sed 's/ /\t/' | sed 's/ /_/g' > NCBI_16SID.txt
grep '>' micfun.fasta | tr -d '>' | sed 's/_/\t/' | sort | uniq > micfunID.txt # Unfortunately several of the headers repeat

# 3
cat micfun.fasta | sed 's/_.*//' > micfunID.fasta
cat NCBI_16S.fasta | sed 's/ .*//' > ncbi16sID.fasta

# 4
cat micfunID.txt NCBI_16SID.txt EzBioCloud/ezbiocloud_id_taxonomy.txt > all_mappings.txt
cat EzBioCloud/ezbiocloud_qiime_full.fasta ncbi16sID.fasta micfun.fasta > all.fasta
qiime tools import     --type FeatureData[Taxonomy]     --input-format HeaderlessTSVTaxonomyFormat     --input-path all.fasta     --output-path all_fasta.qza
qiime tools import     --type FeatureData[Taxonomy]     --input-format HeaderlessTSVTaxonomyFormat     --input-path all_uniqIDs.txt     --output-path Ids.qza
  • Altogether, the database contains 130,122 sequences (though there might be some repetition that was overlooked)
# Script for removing redundant ids
from Bio import SeqIO
import csv

exists: set = set()
mapped = open('all_uniq.fasta', 'w+')
for seq in SeqIO.parse('all.fasta', 'fasta'):
    if seq.id in exists:
        continue
    exists.add(seq.id)
    mapped.write(f'>{seq.id}\n')
    mapped.write(f'{seq.seq}\n')
mapped.close

uniq = open('all_uniqIDs.txt', 'w+')
exists2: set = set()
with open('all_mappings.txt', 'r') as i:
    for id in csv.reader(i, delimiter='\t'):
        if id[0] in exists2:
            continue
        exists2.add(id[0])
        uniq.write(f'{id[0]}\t{id[1]}\n')
uniq.close

Diversity analyses

Alpha diversity

alpha <- get_artifact_data("./results/7-Diversity",
  id_key,
  extension = "",
  metric_list = alpha_metrics
)
faith <- data.frame(row.names = 1:10)
for (site in names(id_key)) {
  faith[[id_key[[site]]]] <- alpha[[site]]$fa[, 2]
}
faith_div_plot <- faith %>%
  gather() %>%
  ggplot(aes(x = value, y = key, fill = key)) +
  ggridges::geom_density_ridges2() +
  guides(fill = "none") +
  labs(y = "Site", x = "Faith diversity")
faith_div_plot
Picking joint bandwidth of 2.34

- Evenness quantifies the distribution of species in a site; the higher the evenness, the more balanced the number of each species. - Low evenness indicates that a site consists mainly of a few very common species

{, comment=NAr} pi_evenness <- data.frame(row.names = 1:10) for (site in names(id_key)) { pi_evenness[[id_key[[site]]]] <- alpha[[site]]$pi[, 1] } even_plot <- pi_evenness %>% gather() %>% ggplot(aes(x = key, y = value, fill = key)) + geom_boxplot() + guides(fill = "none") + labs(x = "Site", y = "Evenness", title = "Pielou evenness") even_plot It’s pretty clear that some sites have much lower evenness than others. Others, like the Barrow mountain sites, Bihor mountains and Catriona snow seem similar from the box plot. Whether or not the difference in evenness is statistically significant can be tested with Kruskal wallis

pi_evenness %>%
  select(c(
    "Barrow mountain high", "Barrow mountain low",
    "Bihor mountains", "Catriona snow"
  )) %>%
  kruskal.test()

    Kruskal-Wallis rank sum test

data:  .
Kruskal-Wallis chi-squared = 3.8824, df = 3, p-value = 0.2744

Beta diversity

Beta diversity quantifies the distance/dissimilarity between sites and is measured on a scale of 0 (identical) to 1 (completely different).

But why use different metrics?

  • Besides total species counts, two other sets of information can be used in the diversity calculation
    • Abundance: the individual counts for each species e.g. 100 observations of species A, 20 of B
    • Phylogenetic distance: the evolutionary relationships between each species
  • Metrics differ mainly by which information they include

But what is phylogenetic distance?

otu_freqs <- lapply(
  get_artifact_data("./results/2-OTUs", id_key, "otuFreqs"),
  as.data.frame
)
fasttree <- get_artifact_data(
  # Approximate maximum likelihood trees, quick and useful for testing data
  "./results/6-RootedTrees", id_key,
  "FastTree_RootedTree"
)
iqtree <- get_artifact_data(
  # Real maximum likelihood trees, accurate, but slow
  "./results/6-RootedTrees", id_key,
  "IQTREE_RootedTree"
)
# Annotate an example tree
matched <- match(iqtree$GrI$tip.label, rownames(otu_freqs$GrI))
freq_mapping <- otu_freqs$GrI[matched, ] %>%
  rowMeans() %>%
  as.data.frame() %>%
  `rownames<-`(seq_along(rownames(.)) %>% paste("OTU", ., sep = ""))
iqtree$GrI$tip.label <- paste(rownames(freq_mapping), "freq =", freq_mapping[[1]])
sample_tree <- iqtree$GrI %>%
  ggtree(layout = "roundrect", aes(color = "#A4E473")) +
  geom_tiplab(size = 3, color = "#004651") +
  geom_tippoint(color = "#66CC8A") +
  labs(title = "Phylogenetic tree of Cryoconite samples") +
  theme(legend.position = "none", axis.text = element_text(size = 14))
sample_tree

Ordination

Beta diversity calculations on multiple sites at once returns a distance matrix, where the first row and column are sites and entries are. This can be depicted using ordination methods

beta <- get_artifact_data("./results/7-Diversity",
  list(Merged = NULL),
  extension = "",
  metric_list = beta_metrics
)
pcoa2D <- get_artifact_data("./results/8-Analysis",
  list(Merged = NULL),
  extension = "PCOA-2D_",
  metric_list = beta_metrics
)
pcoa2D_merged <- lapply(pcoa2D$Merged, metadata_merge_pcoa, metadata = metadata)

pcoaja <- plot_pcoa(pcoa2D_merged$ja, "Location") +
  labs(x = "PC1", y = "PC2", title = "Jaccard distance")

pcoabc <- plot_pcoa(pcoa2D_merged$bc, "Location") +
  labs(x = "PC1", y = "PC2", title = "Bray Curtis")

pcoauu <- plot_pcoa(pcoa2D_merged$uu, "Location") +
  labs(x = "PC1", y = element_blank(), title = "Unwieghted Unifrac")

pcoawn <- plot_pcoa(pcoa2D_merged$wn, "Location") +
  labs(x = "PC1", y = element_blank(), title = "Weighted normalized Unifrac")

pcoa_arrange <- ggarrange(pcoabc, pcoauu, pcoawn,
  ncol = 3,
  common.legend = TRUE,
  legend = "right"
)
pcoa_arrange

- The pcoa plots depict clearly how the choice of distance metric affects the clustering of samples. - Clustering in the Bray Curtis plot represents sites sharing many of the same species and in similar abundances. The big cluster falls apart once we factor the evolutionary distance between OTUs, shown by the Unifrac metrics. - This implies that although some taxa are shared, the unique taxa are a evolutionary distant (essentially have very different DNA) from the shared ones. - Once we add weightings by abundance though, new clusters form, indicating there are many more common taxa within the groups than there are unique taxa. - Even partitioning sites by habitat type - glacier and permafrost, doesn’t help

What to choose?

# Filter the sites
keep <- c("ViS", "BrL", "BrH", "SvG")
filter_wn <- filter_dm(beta$Merged$wn, keep)
filtered_meta <- filter_meta(metadata, keep)

PERMANOVA & ANOSIM

Within the chosen cluster, its visually unclear whether or not the differences between sites are significant. There are specific hypothesis tests for this problem, such as Permutational Analysis of Variance (PERMANOVA) and Analysis of Similarities (ANOSIM)

  • These allow you to say with confidence whether or not certain variables (in this case location) are generating a statistically significant difference in species composition
    • Both PERMANOVA and ANOSIM test the null hypothesis that within-group distances from each group are identical to between-group distance.
    • With this dataset, rejecting the null hypothesis and concluding that between-group distance is different from within-group distance will let us conclude that different locations do have statistically significant species distribution.

TODO:what is the difference between them?

adonis2(filter_wn ~ filtered_meta$Location)
anosim(filter_wn, grouping = filtered_meta$Location)

Call:
anosim(x = filter_wn, grouping = filtered_meta$Location) 
Dissimilarity: 

ANOSIM statistic R: 0.3687 
      Significance: 0.001 

Permutation: free
Number of permutations: 999

Pathway analysis

# Export for far pro tax database
otu_genus <- list()
for (id in names(id_key)) {
  otu_genus[[id]] <- to_genus_csv(otu_freqs[[id]], blast[[id]])
}
genus_combined <- combine_freqs(otu_genus, taxon)
write.csv(combined, "genus_otu_tables.csv", row.names = FALSE)
ko_all <- ko %>%
  # Merge the PICRUSt2 tables
  reduce(merge, by = "pathway", all = TRUE) %>%
  as_tibble() %>%
  replace(is.na(.), 0) %>%
  rel_abund(., pathway) %>% # Convert to relative abundances
  as_tibble()
message(glue("There are {dim(ko_all)[1]} inferred pathways"))
There are 438 inferred pathways
ko_xfunc <- ko_all %>% sites_x_func() # Transpose into sites x function format
ko_all
# Compute bray curtis distance, then plot pcoa
bc_func <- vegdist(ko_xfunc, method = "bray")
pcoa_bc_func <- bc_func %>%
  wcmdscale(k = 2) %>%
  metadata_merge_pcoa(metadata, ., functions = TRUE)
Joining with `by = join_by(sample.id)`
# Compute jaccard distance
ja_func <- vegdist(ko_xfunc, method = "jaccard")

pcoa_ja_func <- ja_func %>%
  wcmdscale(k = 2) %>%
  metadata_merge_pcoa(metadata, ., functions = TRUE)
Joining with `by = join_by(sample.id)`
# Plot and compare with ordination on taxonomy

plot_ja_func <- plot_pcoa(pcoa_ja_func, "Location", functions = TRUE) +
  labs(
    x = element_blank(), y = element_blank(), title = "Jaccard distance",
    subtitle = "From biological pathways"
  )

plot_bc_func <- plot_pcoa(pcoa_bc_func, "Location", functions = TRUE) +
  labs(
    x = "PC1", y = element_blank(), title = "Jaccard distance",
    subtitle = "From biological pathways"
  )

func_compare <- ggarrange(pcoaja + labs(x = element_blank()), plot_ja_func, pcoabc, plot_bc_func,
  ncol = 2, nrow = 2, common.legend = TRUE, legend = "bottom"
) + theme(axis.text = element_text(size = 14))

func_compare

Taxonomic classifications

For sample taxonomic classification, I will be trying out all three of the methods available in qiime2:

blast <- lapply(
  get_artifact_data("./results/3-Classified", id_key, "BLAST_All"),
  parse_taxonomy
)
blast$CaS # The classifications for the Catriona snow site
sklearn <- lapply(
  get_artifact_data("./results/3-Classified", id_key, "Sklearn"),
  parse_taxonomy
)
sk_merged <- read_qza("./results/3-Classified/Merged-Sklearn.qza")$data %>%
  parse_taxonomy()
count_identified(sklearn, "Sklearn")
Sklearn: 0.584407164275873
count_identified(blast, "BLAST")
BLAST: 0.579451912055851
sk_merged

The sklearn classifier has a slightly better number of identifications so will be used for all any downstream analyses

Abundance visualizations

Phyla

all <- read_qza("./results/2-OTUs/Merged-otuFreqs.qza")$data
sk_merged <- read_qza("./results/3-Classified/Merged-Sklearn.qza")$data %>%
  parse_taxonomy()

ranks <- merge_with_id(all, sk_merged, level = 2) %>%
  # Collapse taxnomy into phyla
  filter(!(is.na(taxon))) %>%
  group_by(taxon) %>%
  summarise(across(everything(), sum))

not_bacteria <- c(
  "Arthropoda", "Nanoarchaeota", "Diatomea", "Altiarchaeota",
  "Ascomycota", "Basidiomycota", "Cercozoa", "Ciliophora", "Asgardarchaeota",
  "Phragmoplastophyta", "Euryarchaeota", "Crenarchaeota"
) # These are most likely false positives given the specificity of the 16s rRNA primers used to sequence the samples

tax_sum <- sum_by_site(ranks, id_key, "taxon", not_bacteria)

stacked <- tax_sum %>% ggplot(., aes(x = name, y = value, fill = identifier)) +
  geom_bar(stat = "identity") +
  scale_fill_discrete(name = "Phylum") +
  scale_color_paletteer_d("pals::glasbey") +
  labs(
    x = "Site", y = "Relative abundance", title = "Phyla relative abundance",
    subtitle = "*Putative phyla and false positives
  (non-prokaryotes) removed"
  ) +
  theme(axis.text = element_text(size = 14))
stacked

Pathways

shown_paths <- c(
  "CHLOROPHYLL-SYN", "GLYCOLYSIS", "TCA", "CALVIN-PWY",
  "PENTOSE-P-PWY", "METHANOGENESIS-PWY", "DENITRIFICATION-PWY", "FERMENTATION-PWY",
  "LACTOSECAT-PWY", "METH-ACETATE-PWY"
)
nice_paths <- ko_all %>%
  filter((grepl(paste(shown_paths, collapse = "|"), pathway)))
path_sum <- sum_by_site(nice_paths, id_key, "pathway", NaN)
heat_path <- path_sum %>% ggplot(., aes(x = name, y = identifier, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(
    name = "Relative abundance",
    mid = "seagreen1", low = "springgreen", high = "seagreen"
  ) +
  labs(
    x = "Site", y = "Pathway",
  )
heat_path

Differential abundance analysis

# Differentially abundant taxa between cryosphere types
abc <- ancombc2(
  data = tse, assay_name = "counts", tax_level = chosen_rank,
  fix_formula = var, group = var,
  pairwise = TRUE
)
lfc <- prepare_abc_lfc(abc, "Type", "res_pair", chosen_rank, false_positives)
# Test takes some time to run, so save results
write.csv2(abc$res_pair, "./results/8-ANCOM-BC/all_taxon_res.csv", row.names = FALSE)
write.csv2(lfc, "./results/8-ANCOM-BC/taxon_lfc.csv", row.names = FALSE)
abc_paths <- ancombc2(
  data = tse_paths, assay_name = "counts", tax_level = "Species",
  fix_formula = "Type", pairwise = TRUE, group = "Type"
)
path_lfc <- prepare_abc_lfc(abc_paths, "Type", "res", NA, NA)
write.csv2(abc_paths$res_pair, "./results/8-ANCOM-BC/all_path_res.csv", row.names = FALSE)
write.csv2(path_lfc, "./results/8-ANCOM-BC/path_lfc.csv", row.names = FALSE)

Differentially abundant taxa between sites

abc_taxon <- read.csv2("./results/8-ANCOM-BC/all_taxon_res.csv")
taxon_all_counts <- abc_taxon %>%
  ancombc_select(glue("diff_{var}"), chosen_rank, false_positives) %>%
  select(taxon) %>%
  unique() %>%
  dim() # Collect counts of all taxa
lfc <- read.csv2("./results/8-ANCOM-BC/taxon_lfc.csv")

percent_abund <- ((length(unique(lfc$taxon)) / taxon_all_counts[1]) %>% round(digits = 2)) * 100
print(glue("Percent of differentially abundant classes: {percent_abund}%"))
Percent of differentially abundant classes: 66%
lfc <- lfc %>% quartile_filter()
# We keep the taxa with the highest and lowest log-fold changes between the types
tax_ids <- as.character(1:length(lfc$taxon))
lfc$taxon <- paste(c(glue("{tax_ids}:")), lfc$taxon)

abc_plot <- abc_lfc_plot(lfc) + scale_fill_discrete(name = "Class") + labs(x = "Type") +
  geom_label(aes(label = tax_ids),
    label.size = 0.15,
    position = position_dodge(width = .9), show.legend = FALSE
  )

most_abund <- lfc %>%
  filter(lfc == max(lfc)) %>%
  select(taxon)
message(glue("The most abundant class is {most_abund}"))
The most abundant class is 51: KD4-96
abc_taxon
abc_plot

The top three most abundant classes TODO: Talk about the most abundant classes and pathways KD4-96 is a uncharacterized class of the phylum Chloroflexi

Plotting pathways

all_path_lfc <- read.csv2("./results/8-ANCOM-BC/all_path_res.csv")
path_lfc <- read.csv2("./results/8-ANCOM-BC/path_lfc.csv")

path_all_counts <- all_path_lfc %>%
  ancombc_select(glue("diff_{var}"), NA, NA) %>%
  select(taxon) %>%
  unique() %>%
  dim()
percent_abund <- ((length(unique(path_lfc$taxon)) / path_all_counts[1]) %>% round(digits = 2)) * 100
path_all_counts
[1] 419   1
print(glue("Percent of differentially abundant pathways: {percent_abund}"))
Percent of differentially abundant pathways: 33
path_lfc <- path_lfc %>% quartile_filter()
ids <- as.character(1:length(path_lfc$taxon))
path_lfc$taxon <- paste(c(glue("{ids}:")), path_lfc$taxon)

abc_pathways <- abc_lfc_plot(path_lfc) +
  scale_fill_discrete(name = "Pathway") +
  geom_label(aes(label = ids),
    label.size = 0.15,
    position = position_dodge(width = .9), show.legend = FALSE
  )
most_abund <- path_lfc %>%
  filter(lfc == max(lfc)) %>%
  select(taxon)
message(glue("The most differentially abundant pathway is {most_abund}"))
The most differentially abundant pathway is 49: PWY0-1338
all_path_lfc
abc_pathways

Interpretation

  • ANCOMBC performs pairwise comparisons of the abundance of a given species in one sample vs in a another sample
  • It returns five metrics for each OTU:
    • Log fold change of the OTU
    • W-score: log fold change across different sites divided by standard error. This is ANCOM-BC’s test statistic
    • p value: whether or not the log fold change for a given OTU is statitically significant
    • q value: the adjusted p-values
    • Standard error
  • The W value for the ith species is the number of times that the null hypothesis is rejected for that species i. We can simply interpret higher W values to mean that a species is more abundant than others.
    • W value also works in the opposite direction where negative values mean less abundance

Random forest for site prediction

bad_names <- ranks[grep("-", ranks$taxon), ]$taxon
new_names <- bad_names %>% gsub("-", "_", .)

phyla_abund <- ranks %>%
  rel_abund() %>%
  t() %>%
  as.data.frame() %>%
  `colnames<-`(.[1, ]) %>%
  filter(!(grepl("NzS", rownames(.)))) %>%
  select(!all_of((not_bacteria))) %>%
  rename_with(~ "Marinimicrobia", "Marinimicrobia_(SAR406_clade)") %>%
  rename_with(~ new_names, bad_names) %>%
  slice(-1) %>%
  rownames_to_column("pred") %>%
  mutate_at(.vars = c(1:length(colnames(.)))[-1], .funs = as.numeric) %>%
  as_tibble()

phyla_train <- phyla_abund %>%
  mutate(pred = lapply(pred, function(x) {
    return(filter(metadata, sample.id == x)$Type)
  })) %>%
  mutate(pred = unlist(pred)) %>%
  mutate(pred = as.factor(pred)) # Site needs to be converted to a factor object

num_features <- length(colnames(phyla_train))
message(glue("Trained on {num_features} features"))
Trained on 54 features
# Train the forest
set.seed(2002)
preds <- sample(2, nrow(phyla_train), replace = TRUE, prob = c(0.7, 0.3))
train <- phyla_train[preds == 1, ]
test <- phyla_train[preds == 2, ]
rf <- randomForest(pred ~ ., data = train, proximity = TRUE)
print(rf)

Call:
 randomForest(formula = pred ~ ., data = train, proximity = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 7

        OOB estimate of  error rate: 3.53%
Confusion matrix:
           Glacier Other Permafrost class.error
Glacier         27     0          0  0.00000000
Other            0    35          1  0.02777778
Permafrost       0     2         20  0.09090909
try <- predict(rf, train)
confusionMatrix(try, train$pred)
Confusion Matrix and Statistics

            Reference
Prediction   Glacier Other Permafrost
  Glacier         27     0          0
  Other            0    36          0
  Permafrost       0     0         22

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9575, 1)
    No Information Rate : 0.4235     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: Glacier Class: Other Class: Permafrost
Sensitivity                  1.0000       1.0000            1.0000
Specificity                  1.0000       1.0000            1.0000
Pos Pred Value               1.0000       1.0000            1.0000
Neg Pred Value               1.0000       1.0000            1.0000
Prevalence                   0.3176       0.4235            0.2588
Detection Rate               0.3176       0.4235            0.2588
Detection Prevalence         0.3176       0.4235            0.2588
Balanced Accuracy            1.0000       1.0000            1.0000
varImpPlot(rf)

References